You can use Python string functions to do some basic data cleaning. (For data sets with more complex cleaning needs, you might want to use a power tool like Open Refine.)
Here, we're going to write a function that takes in one row of data as input, cleans up the pieces of data in the row, then returns a cleaned version of that row. As we loop over the data in the file, we'll call this function on each row, then write out the clean row to a new file.
Let's break down our tasks:
We're going to be working with the FDIC's list of failed banks.
First, we need to write our cleaning function -- let's call our function clean_row()
. We need to decide whether the row it parses will be a dictionary (using csv.DictReader
) or a list (using csv.reader
).
Let's use a dictionary.
Here are the fields that we are going to include in our output file. The ones that need cleaning are in bold.
In [ ]:
# first line defines the function and the argument
# ("row" is an arbitrary variable name)
def clean_row(row):
"""
For the bank and institution name:
- strip whitespace
- uppercase the name
- replace '&' with 'AND'
n.b.: you can chain string methods together
"""
clean_bank = row['Bank Name'].strip().upper().replace('&', 'AND')
clean_inst = row['Acquiring Institution'].strip().upper().replace('&', 'AND')
# strip whitespace and upcase the city
clean_city = row['City'].strip().upper()
# return a dictionary of clean data
# the keys ~must~ match the headers of our output file
return {
'bank': clean_bank,
'inst': clean_inst,
'city': clean_city,
'st': row['ST'],
'c_date': row['Closing Date']
}
In [ ]:
# import the csv library
import csv
# open the two files
with open('data/failed-banks.csv', 'r') as infile, open('banks-clean.csv', 'w') as outfile:
# create a DictReader object
reader = csv.DictReader(infile)
# create a DictWriter object
# the fieldnames must exactly match the keys in the dictionary being returned
# by our cleaning function
writer = csv.DictWriter(outfile, fieldnames=['bank', 'inst', 'city', 'st', 'c_date'])
# write out header row
writer.writeheader()
# loop over the rows of raw data
# "row" is an arbitrary variable name
for row in reader:
# call the cleaning function on the row
cleaned = clean_row(row)
# write out the clean row
writer.writerow(cleaned)
The Closing Date
field in the bank failure data is in this format: 6-Sep-2011
. In other words, day, then abbreviated month as text, then year.
Python's built-in datetime
module has two methods that can help us reformat them: strftime()
and strptime()
.
Your task: Add some code to the cleaning function to reformat the closing date in yyyy-mm-dd
format. This will require doing some research into a module that we haven't discussed yet. (Good practice for when you're coding on your own.)
Breaking it down into smaller tasks:
datetime
from the datetime
module: from datetime import datetime
strptime()
to turn a 6-Sep-2011
-type string into a Python date objectyyyy-mm-dd
using strftime()
Google is your friend here. Try searching for things like "python strptime example." (Freebie: Here's a handy guide to the date directives.) Noodle around in a cell. Get something working for one date -- a test string -- before setting your solution loose on the whole file. Try new things, see what happens, fail, find solutions. It's all part of the learning process.
In [ ]:
import csv
from datetime import datetime
def clean_row(row):
clean_bank = row['Bank Name'].strip().upper().replace('&', 'AND')
clean_inst = row['Acquiring Institution'].strip().upper().replace('&', 'AND')
clean_city = row['City'].strip().upper()
# reformat the date
clean_date = datetime.strptime(row['Closing Date'], '%d-%b-%y').strftime('%Y-%m-%d')
return {
'bank': clean_bank,
'inst': clean_inst,
'city': clean_city,
'st': row['ST'],
'c_date': clean_date
}
with open('data/failed_banks.csv', 'r') as infile, open('banks-clean.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=['bank', 'inst', 'city', 'st', 'c_date'])
writer.writeheader()
for row in reader:
cleaned = clean_row(row)
writer.writerow(cleaned)
In [ ]: